[python] support read after data evolution updating by shard#7157
[python] support read after data evolution updating by shard#7157JingsongLi merged 20 commits intoapache:masterfrom
Conversation
| row_tracking_enabled: bool, | ||
| system_fields: dict): | ||
| system_fields: dict, | ||
| requested_field_names: Optional[List[str]] = None): |
There was a problem hiding this comment.
You should just use fields: List[DataField]?
There was a problem hiding this comment.
You should just use
fields: List[DataField]?
Updated
| """Ensure _ROW_ID and _SEQUENCE_NUMBER are not null (per SpecialFields).""" | ||
| fields = [] | ||
| for field in schema: | ||
| if field.name == SpecialFields.ROW_ID.name or field.name == SpecialFields.SEQUENCE_NUMBER.name: |
There was a problem hiding this comment.
Why it can be nullable?
A bug here. Nullable info of row-tracking system fields is lost during _assign_row_tracking. Opened a separate PR #7174 to fix it.
72ffd99 to
7e5f55a
Compare
c00cafa to
cb28d6b
Compare
55289ad to
277fef4
Compare
277fef4 to
7d57ac4
Compare
| same order for that shard. | ||
| - **Parallelism**: run multiple shards by calling `new_shard_updator(shard_idx, num_shards)` for each shard. | ||
|
|
||
| ## Read After Partial Shard Update |
There was a problem hiding this comment.
I feel like this document doesn't make much sense
There was a problem hiding this comment.
I feel like this document doesn't make much sense
Removed
| ).slice(0, min_rows) | ||
| columns.append(column) | ||
| else: | ||
| columns.append(pa.nulls(min_rows, type=self.schema.field(i).type)) |
There was a problem hiding this comment.
This work should be done in DataFileBatchReader?
| else: | ||
| field = self.schema_map.get(name) | ||
| inter_arrays.append( | ||
| pa.nulls(num_rows, type=field.type) if field is not None else pa.nulls(num_rows) |
There was a problem hiding this comment.
I don't get it, FormatPyArrowReader have already handled read_fields.
JingsongLi
left a comment
There was a problem hiding this comment.
I ran your test and tried to fix it. All I need to do is modify FormatPyArrowReader out_fields.append(pa.field(field_name, pa.null(), nullable=True)), Do not pass null type, pass the correct type to fix it.
My bad, fixed by updating |
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks @XiaoHongbo-Hope ! Looks good to me.
Problem
When the user updates a column for only one shard (e.g. ShardTableUpdator runs shard 0 only and writes new column
d), full table read fails:Only that shard’s files have the new column; other files do not. Concat batches → schema mismatch → crash. To fix the issue, we support data evolution shard read.
Tests
API and Format
Documentation